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Abstract 

Catastrophic regime shifts in complex natural systems may be averted through advanced detection. 
Recent work has provided a proof-of-principle that many systems approaching a catastrophic transition 
may be identified through the lens of early warning indicators such as rising variance or increased return 
times. Despite widespread appreciation of the difficulties and uncertainty involved in such forecasts, 
proposed methods hardly ever characterize their expected error rates. Without the benefits of replicates, 
controls, or hindsight, applications of these approaches must quantify how reliable different indicators 
are in avoiding false alarms, and how sensitive they are to missing subtle warning signs. We propose 
a model based approach in order to quantify this trade-off between reliability and sensitivity and 
allow comparisons between different indicators. We show these error rates can be quite severe for 
common indicators even under favorable assumptions, and also illustrate how a model-based indicator 
can improve this performance. We demonstrate how the performance of an early warning indicator 
varies in different data sets, and suggest that uncertainty quantification become a more central part of 
early warning predictions. 
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1. Introduction 



There is an increasing recognition of the importance of regime shifts or critical transitions at a 



variety of scales in ecological systems (Rolling 1973 Wissel, 1984 Scheffer et al. , 2001, 2009 Drake 



and Griffen, 2010 Carpenter, 2011). Many important ecosystems may currently be threatened with 



collapse, including corals (Bellwood et al. , 2004), fisheries (Berkes et al. 2006), lakes (Carpenter 



2011), and semi-arid ecosystems (Kefi et al. , 2007). Given the potential impact of these shifts on the 



sustainable delivery of ecosystem services (Folke et al. , 2004) and the need for management to either 
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avoid an undesirable shift or else to adapt to novel conditions, it is important to develop the ability to 
predict impending regime shifts based on early warning signs. 

A number of particular systems have demonstrated the kinds of relationships that would produce 



regime shifts, including dynamics of coral reefs (Mumby et al. , 2007), and simple models of metapopu- 



lations with differing local population sizes (Hastings, 1991). In cases like these one sensible approach 
to understanding whether a regime shift would be likely would be to fit the model using either a time 
series or else independent estimates of parameters. More generally, with a good model of the system. 



detail-oriented approaches could be useful (Lade and Gross, 2012). In this treatment we focus on the 
situation where these more detailed models are not available. 

Indeed, for many ecological systems specific models are not available and general approaches are 



needed (Scheffer et al. , 2009 Lade and Gross, 2012) that do not depend on estimating the parameters of 
a known model of a specific system. This has led to a variety of approaches based on summary statistics 



{e.g. Carpenter and Brock 2006 Held 2004 Dakos et al. 2008 Guttal and Jayaprakash 2008b Biggs 



et al. , 2009 Carpenter, 2011 Seekell et al. 2011 ) that look for generic signs of impending regime shifts. 



Here we extend earlier work by providing estimates of the ability of different potential indicators to 
accurately signal impending regime shifts, and develop new approaches that both are more efficient and 
also lay bare some of the important assumptions underlying attempts to find general warning signs of 
regime shifts. We distinguish this question from the extensive literature involving change-point analysis 



for the post-hoc identification of if and when a regime shift has occurred (Easterling and Peterson 1995 



Rodionov, 2004, Lenton et al. , 2009), which is of little use if the goal is the advanced detection of the 



shift. 

We begin by discussing the limitations of current approaches that rely on summary statistics and 
provide a description of assumptions through the introduction of a model based approach to detect 
early warning signals. We then illustrate how stochastic differential equation (SDE) models can be 
used to reflect the uncertainty inherent in the detection of early warning signals. We caution against 
paradigms that are not useful for capturing uncertainty in a model-selection based approach, such as 



information criteria. Finally we use receiver-operating characteristics (Green and Swets, 1989; Keller 



et al. , 2009 ) as a way to illustrate the sensitivity different data sets and different indicators have in 



detecting early warning signals and use this to explore a number of examples. This approach provides a 
visualization of the types of errors that arise and how one can trade off between them, and is important 
for framing the problem as one focused on prediction. 
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2. The summary statistics approach 



Foundational work on early warning signals has operated under the often-implicit assumption that 
the system dynamics contain a saddle-node bifurcation by looking for patterns that are associated 
with this kind of transition. A saddle-node bifurcation occurs when a parameter changes and a stable 
equilibrium (node) and an unstable equilibrium (saddle) coalesce and dissapear. The system then 



moves to a more distant equilbrium. Guckenheimer and Holmes (1983) or any other textbook on 



dynamical systems will provide precise definitions and further explanation. 

Typical patterns used as warning signals include an increasing trend in a summary statistic such as 



variance (Carpenter and Brock, 2006), autocorrelation (Held, 2004 Dakos et al. , 2008), skew (Guttal 



and Jayaprakash , 2008b ) , spectral ratio ( Biggs et al. 2009 ) . While attractive for their simplicity, such 



approaches must confront numerous challenges. In this paper we argue for a model-based approach to 
warning signals, and describe how this can be done in a way that best addresses these difficulties. We 
begin by enumerating several of the difficulties encountered in approaches lacking an explicit model. 

Hidden assumptions 

The underlying assumption that the system contains a saddle-node bifurcation can be easily over- 
looked in common summary-statistics based approaches. For instance, variance may increase for reasons 



that do not signal an approaching transition (Schreiber, 2003; Schreiber and Rudolf 2008). Alterna- 



tively, variance may not increase as a bifurcation is approached (Livina et al. , 2012; Dakos et al 



2011b). Some classes of sudden transitions may exhibit no warning signals Hastings and Wysham 



(2010). Like saddle- node bifurcations, transcritical bifurcations involve an eigenvalue passing through 



zero, and exhibit the patterns of critical slowing down and increased variance (Drake and Griffen, 2010) 



However, transcritical bifurcations involve a change in stability of a fixed point, rather than the sudden 
dissapearance of a fixed point that has made critical transitions so worrisome. While no approach will 
be applicable to all classes of sudden transitions, it is certainly still useful to have an approach that 
detects transitions driven by saddle- node bifurcations, which have been found in many contexts {e.g., 
see 



Scheffer et al., 2001). 



Even when we can exclude or ignore other dynamics and restrict ourselves to systems that can 
produce a saddle-node bifurcation, approaches based on critical slowing down or rising variance {e.g. 



Held, 2004 Scheffer et al. , 2009 Carpenter, 2011) must further assume that a changing parameter 



has brought the system closer to the bifurcation. This assumption excludes at least three alternative 
explanations for the transition in system behavior. The first possibility is that a large perturbation of 
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the system state has moved the system into the alternative basin of attraction (Scheffer et al. , 2001). 
This is an exogenous forcing that does not arise from the system dynamics, so it is not the kind of 
event we can expect to forecast. (An example might be a sudden dramatic increase in fishing effort 
that pushes a harvested population past a threshold.) The second scenario is a purely noise-induced 



transition, a chance fluctuation that happens to carry the system across the boundary (Ditlevsen and 



Johnsen, 2010). Livina et al. (2012) indicate that such noise induced transitions cannot be predicted 
through early warning signals - at least they are not expected to exhibit the same early warning patterns 
of increased variance and increased autocorrelation anticipated in the case of a saddle-node bifurcation. 
The third scenario is that the system does pass through a saddle-node bifurcation, but rather than 
gradually and monotonically approaching the critical point, the bifurcation parameter moves in a rapid 
or highly non-linear way, making the detection of any gradual trend impossible. 

Arbitrary windows 

In addition to the assumption of a saddle-node bifurcation, the calculation of statistics that would 
be used to detect an impending transition is subject to several arbitrary choices. A basic difficulty arises 
from the need to assume a time-series is ergodic: that averaging over time is equivalent to averaging over 
replicate realizations, while trying to test if it is not. Theoretically, the increasing trend in variance, 
autocorrelation, or other statistics is something that would be measured across an ensemble - across 
replicates. As true replicates are seldom available in systems for which developing warning signals 
would be most desirable, typical methods average across a single replicate using a moving window in 
time. The selection of the size of this window and whether and by how much to overlap consecutive 



windows varies across the literature. Lenton et al. (2012) demonstrates that these differences can 



influence the results, and that the different choices each carry advantages and disadvantages. 

In addition to introducing the challenge of selecting a window size, this ergodic assumption raises 
further difficulties. While appropriate for a system that is stationary, or changing slowly enough in the 
window that it may appear stationary, the assumption is at odds with the original hypothesis that the 
system is approaching a saddle-node bifurcation. 

Further, certain statistics such as the critical slowing down measured by autocorrelation require 
data that is evenly sampled in time. Interpolating from existing data to create evenly spaced points is 
particularly problematic, as this introduces an artificial autocorrelation into the data. 
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No quantitative measures 

Summary statistics typically invoke qualitative patterns such as an increase in statistic x, rather 
than a quantitative measure of the early warning pattern. This makes it difficult to compare between 
signals or to attribute a statistical significance to the detection. Some authors have suggested Kendall's 



correlation coefficient, r, could be used to quantify an increase (Dakos et al. , 2008, 2011a) in autocorre- 



lation or variance. Other measures of increase, such as Pearson's correlation coefficient have also been 



proposed (Drake and Griffen, 2010), while most of the literature simply forgoes quantifying the increase 



or estimating significance. While adequate in experimental systems that can compare patterns between 



controls and replicates {e.g. Drake and Griffen, 2010; Carpenter, 2011), any real-world application of 
these approaches must be useful on a single time-series of observations. In these cases a quantitative 
definition of a statistically significant detection is essential. Without this, we have no assurance that 
a purported detection is not, in fact, a false positive. By focusing primarily on examples known to 
be approaching a transition when testing warning signals, the probability of false positives has largely 
been overlooked. 



Problematic null models 

Specifying an appropriate null model is also difficult. Non-parametric null hypotheses seem to 
require the fewest assumptions but in fact can be the most problematic. For instance, the standard 
non-parametric hypothesis test with Kendall's tau rank correlation coefficient assumes only that the 
two variables are independent, but this is an assumption that is violated by the very experimental 
design: temporal correlations will exist in any finely-enough sampled time series, and moving windows 
introduce temporal correlations in the statistics. Under such a test any adequately large data set 
will find a significant result, regardless of whether a warning signal exists. A similar problem arises 
when the points in the time series are reordered to create a null hypothesis - this destroys the natural 
autocorrelation in the time series. More promising parametric null models have been proposed, such as 



autoregressive models in Dakos et al. (2008), bringing us closer to a model-based approach with explicit 



assumptions. Others have looked for alternative summary statistics where reasonable null models are 



more readily available, such as Seekell et al. (2011 )'s proposal to test for conditional heteroscedasticity. 



Summary- statistic approaches have less statistical power. 



Methods for the detection of early warning signals are continually challenged by inadequate data ( In 



man 



20TI| ISchefferj [20Tol [Heidi [2004} [Dakos et al.j [20081 IScheffer et al.j [20091 IGuttal and Jayaprakash[ 



2008b: Carpenter, 2011: Bestelmeyer et al. 2011). Despite the widespread recognition of the this 
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need for large data sets, there has been very few studies quantitative studies of power to determine 



at how much data is required (Contamin and Elhson 2009), how often a particular method would 



produce a false alarm or fail to detect a signal, and which tests will be the most powerful or sensitive. 
The Neyman- Pearson Lemma demonstrates that the most powerful test between hypotheses compares 



the likelihood that the data was produced under each (Neyman and Pearson, 1933). Such likelihood 
calculations require a model-based approach. 



3. A model based approach 

Model-based approaches are beginning to play a larger role in early warning signal detection, though 
we have not as yet seen the direct fitting and simulation of models to compare hypotheses. While choos- 
ing appropriate models without system-specific knowledge is challenging, much can be accomplished 



by framing the implicit assumptions into equations. Lade and Gross (2012) introduce the idea of 



generalized models for early warning signals, and Kuehn (2011) presents normal forms for bifurcation 



processes that can give rise to critical transitions. Carpenter and Brock ( 2011| ) and Dakos et al. (2011b) 
start by assuming the dynamics obey a generic stochastic differential equation (SDE), but use this only 
to derive or define the summary statistics of interest. 

In this section we outline how the detection of early warning signals may be thought of as a 
problem of model choice. We next show generic models can be constructed under the assumptions 
discussed above and estimated from the data in a maximum likelihood framework. We highlight the 
disadvantages of comparing these estimates by information criteria, and instead introduce a simulation 



or bootstrapping approach rooted in Cox (1961) and McLachlan (1987) that characterizes the rate of 



missed detections and false alarms expected in the estimate. 

Early warning signals as model choice 

It may be useful to think of the detection of early warning signals as a problem of model choice 
rather than one of pattern recognition. The model choice approach attempts to frame each of the 
possible scenarios as structurally different equations, each with unknown parameters that must be 
estimated from the data. In any model choice problem, it is important to identify the goal of the 



exercise - such as the ability to generalize, to imitate reality, or to predict (Levins, 1966). In this case 



generality is more important than realism or predictive capability: we will write down a general model 
that is capable of approximating a wide class of models in which regime shifts are characterized by 
a saddle-node bifurcation, and a second generic model that is capable of representing the behavior of 
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such systems when they are not approaching a bifurcation. These may be thought of as the hypothesis 
and null hypothesis, though they are in fact compound hypotheses, as we must first estimate the model 
parameters from the data. In this approach it is not assumed that "reality" is included in the models 
being tested, but that one of the models is a better approximation of the true dynamics than the other. 
System whose dynamics violate the assumptions common to both models, such as in the examples of 



Hastings and Wysham (2010) where systems exhibit sudden transitions without warning, fall outside 
the set of cases where this approach would be valid; though the inability of either model to match the 
system dynamics could be an indication of such a violation. 

Models 

In the neighborhood of a bifurcation a system can be transformed into its normal form by a change 



of variables to facilitate analysis ( Guckenheimer and Holmes 1983). The normal form ( Guckenheimer 



(1) 



and Holmes, 1983 Kuehn, 2011) for the saddle-node bifurcation is 

drc 2 
— = rt — X . 
dt 

where x is the state variable and rt our bifurcation parameter. We have added a subscript t to the 
bifurcation parameter as a reminder that it is the value which may be slowly varying in time and 



consequently moving the system closer to a critical transition or regime shift (Scheffer et al. , 2009). 
Transforming this canonical form to allow for an arbitrary mean in the state variable 0, the system 
near the bifurcation looks like dx/dt = rt — {9 — x)^, with fixed point x = ^/rt + 6 =: </'(?'t)- We expand 



around the fixed point and express as a stochastic differential equation {e.g. Gardiner 2009): 



dX = ^tWn) - Xt)dt + a^/^dBt (2) 

where Bt is the standard Brownian motion. This expression captures the behavior of the system 
near the stable point as it approaches the bifurcation. Allowing the stochastic term to scale with the 
square root of (p follows from the assumption that of an internal- noise process, such as demographic 



K amp en ( 


2007 


) or 


Black and 



McKane (2012). The square root could be removed for an external noise process, such as environmental 
noise. In practice it will be difficult to descriminate between the square root and linear scaling in these 
applications, since the average value of the state changes little before the bifurcation. 

As we discuss above, in this paradigm we must include an assumption on how the bifurcation 
parameter, rt, is changing. We assume a gradual, monotonic change which we approximate to first 
order: 
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Tt = tq — mt. 



(3) 



Detecting accelerating or otherwise nonlinear approaches to the bifurcation will generally require 
more power. When the underlying system is not changing, rt is constant (m = 0) and Equation (|2]) 
will reduce to a simple Ornstein-Uhlenbeck process, 

dXt = r{e - Xt)dt + adBt (4) 
This is the continuous time analog of the first-order autoregressive model considered as a null model 



elsewhere {e.g. Dakos et al. , 2008 Guttal and Jayaprakash, 2008a) 



Likelihood calculations 

The probability P{X\M) of the data X given the model M is the product of the probability of 
observing each point in the time series given the previous point and the length of the interval. 



logP(X|M) = J^logP(x,|xi_i,ti 



(5) 



For ([2]) or Q it is sufficient ( Gardiner 2009 ) to solve the moment equations for mean and variance 
respectively: 



dt 

d^ 

dt 



E{x\M) = f{x) 

V{x\M) = -d^f{x)V{x\M) + g{x) 



(6) 
(7) 



For the OU process, we can solve this in closed form over an interval of time ti between subsequent 
observations: 



E{x.i\M = OU) = Xi_ie-''*' +e{l- e"''*') (8) 



V{x^\M = OU) = ^ (1 - (9) 



2 



2r 



For the time dependent model, we have analytic forms only for the dynamical equations of these 
moments from equation ([T]), which we must integrate numerically over each time interval. The moments 
of Equation Q are given by 
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dt 
d 

dt 



E{xi\M = LSN) = 2^/^{^/^ + e-Xi) 
V{xi\M = LSN) = -2^J^V{xi) + a2(v^ + 9) 



(10) 
(11) 



These are numerically integrated using Isoda routine available in R for the likelihood calculation. 
Comparing Models 

Likelihood methods form the basis of much of modern statistics in both Frequentist and Bayesian 
paradigms. The ability to evaluate likelihoods directly by computation has made it possible to treat 
cases that do not conform to traditional assumptions more directly. The basis of likelihood comparisons 
has its roots in the Neyman-Pearson Lemma, which essentially asserts that comparing likelihoods is 



the most powerful test of a choice between two hypotheses ( Neyman and Pearson , 1933 ) , and motivates 



tests from the simple likelihood ratio test up through modern model adequacy methods. 

The hypotheses considered here are more challenging then the original lemma provides for, as they 
are composite in nature: they specify two model forms (stable and changing stability) but with model 
parameters that must be first estimated from the data. Comparing models whose parameters have been 



estimated by maximum likelihood is first treated by Cox (1961 1962), and has since been developed 



in this simulation estimation of the null distribution (McLachlan, 1987), by parametric bootstrap 



estimate (Efron, 1987). Cox's 5 statistic (often called the deviance between models) is simply the 



difference between the log likelihoods of these maximum likelihood estimates, defined as follows. 

Let Lq be the likelihood function for model 0, let = argmax^o £ ^o? (-^^o(^o|-^)) be the maximum 
likelihood estimator for given X, and let Lq = Lo(^o|^)i and define Li, 0i, Li similarly for model 
1. The statistic we will use is 5, defined to be twice the difference in log likelihood of observing the 
data under the two MLE models, 

5 = -2(logLo-logLi). (12) 



This approach has been applied to the problem of model adequacy (Goldman, 1993) and model 



choice (Huelsenbeck and Bull, 1996) in other contexts. We have extended the approach by gener- 



ating the test distribution as well as a null distribution of the statistic 5. 

3.1. Simulation-based comparisons 

We perform the identical analysis procedure described above on each of these three data sets. First, 
we estimate parameters for the null and test model to each data set by maximum likelihood. Comparing 
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the likelihood of these fits directly gives us only a minimal indication of which model fits better. To 
identify if these differences are significant, and by what probability they could arise as a false alarm or 
a missed event, we simulate 500 replicate time series from each estimated model. 

The model parameters of both models arc rc-estimated on both families of replicates (the null and 
test, i.e. 2 X 2 X 500 fits). The differences in the likelihood values between the model estimates produced 
from the first set of simulations determines the null distribution for the deviance statistic 6. As the 
constant OU process model is nested within the time-heterogeneous model, these values are always 
positive, but tend to be not as large as those produced when the models are fit to the second family of 
data. 

The extent to which these distributions overlap indicates our inability to distinguish between these 
scenarios. The tendency of the observed deviance to fall more clearly in the domain of one distribution 
or the other indicates the probability our observed data corresponds best with that model - either 
approaching a critical transition or remaining stable. While it trivial to assign a statistical significance 
to this observation based on how far into the tail of the null distribution it falls, for the reasons we 
discussed we prefer the more symmetric comparison of the probability that this value was observed in 
either distribution. We visualize the trade-off between false alarms and failed detection using the ROC 
curves introduced above. 

Information criteria will not serve. 

One will commonly observe models representing alternative processes being compared through the 
use of various information criteria such as the Akaike information criterion. While tempting to apply 
in this situation, such approaches are not suited to this problem for several reasons. The first is 
that information criteria are not concerned with the model choice objective we have in mind, as they 
are typically applied to find an adequate model description without too many parameters that the 
system may be over- fit. More pointedly, information criteria have no inherent notion of uncertainty. 
Information criteria tests alone will not tell us our chances of a false alarm, of missing a real signal, or 
how much data we need to be confident in our ability to detect transitions. 

Beyond hypothesis testing 

It is possible to frame the question of sensitivity, reliability, and adequate data in the language 
of hypothesis testing. This introduces the need for selecting a statistical significance criterion. In 
the hypothesis testing framework, a false positive is a Type I error, which is defined relative to this 
arbitrary statistical significance criterion, most commonly 0.05. By changing the criterion, one can 
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increase or decrease the probability of the Type I error at the cost of decreasing or increasing false 
negative or Type II error, which must also be defined relative to this criterion. 

The language of hypothesis testing is built around a bias that false positives are worse than false 
negatives, and consequently an emphasis on values rather than power. In the context of early warning 
signals this is perilous - it suggests that we would rather fail to predict a catastrophe than to sound 
a false alarm. To avoid this linguistic bias and the introduction of an nuisance parameter on which to 
define statistical significance, we propose the use of receiver operating characteristic (ROC) curves. 

ROC Curves 

We illustrate the trade-off between false alarms and failed detection using receiver-operating char- 



acteristic curves first developed in signal-processing literature (Green and Swets, 1989: Keller et al. 



2009). The curves represent the corresponding false alarm rate at any detection sensitivity (true pos- 
itive rate). Fig [l| The closer these distributions are to one-another, the more severe the trade-off. 
If the distributions overlap exactly, the ROC curve has a constant slope of unity. The ROC curve 
demonstrates this trade-off between accuracy and sensitivity. Different early-warning indicators will 
vary in their sensitivity to detect differences between stable systems and those approaching a critical 
transition, making the ROC curves a natural way to compare their performance. Since the shape of 
the curve will also depend on the duration and frequency of the time-series observations, we can use 
these curves to illustrate by how much a given increase in sampling effort can decrease the rate of false 
alarms or failed detections. 

4. Example Results 

We illustrate this approach on simulated data as well as several natural time-series that have been 
previously analyzed for early warning signals. All data and code for simulations and analysis are found 
in the accompanying R package, earlywarning. 

Data 

The simulation implements an individual, continuous-time stochastic birth-death process with rates 



given by the master equation (Gardiner, 2009), 
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Test statistic (arbitrary units) 
2 6 10 2 6 10 2 6 10 2 6 10 




0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 

False Positive 

Figure 1: Top row: The distributions of a hypothetical warning indicator are shown under the case of a stable system 
(blue) and a system approaching a critical transition (red). Bottom row: Points along the ROC curve are calculated for 
each possible threshold indicated in the top row. The false positive rate is the integral of the distribution of the test 
statistic under the stable system right of the threshold (blue shaded area, corresponding to blue vertical line). The true 
positive rate is the integral of the system approaching a transition left of the threshold (red shaded area, corresponds to 
the red line). Successive columns show the threshold increasing, tracing out the ROC curve. 
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dP{n,t) 
dt 



bn-iP{n - 1, t) + dn+iP{n + 1, t) - (6„ + t) 
eKin? 



dn = en + at 



(13) 
(14) 
(15) 



where P{n, t) is the probabihty of having n individuals at time t, 6„ is the probabihty of a birth 
event occurring in a population of n individuals, dn the probability of a death, e, -ftT, /i and at are 



parameters. This corresponds to the well-studied ecosystem model of over-exploitation (Noy-Meir 



1975; May, 1977), with stochasticity introduced directly through the demographic process. We select 



this model since it is has discrete numbers of individuals, nonlinear processes, and the noise is driven 
by Poisson process of births and deaths instead of a Gaussian, and thus provides an illustration that 
our approach is robust to the violations of those assumptions in model ([2]). 

This model is forced through a bifurcation by gradually increasing the a parameter, which increases 
can be thought of as an increasing toxicity of the environment (from ao = 100 increasing at constant 
rate of 0.09 units/unit time). Other parameters are: Xo = 730, e = 0.5, K = 1000, h = 200. We run 
this model over a time interval from to 500 and sample at 40 evenly spaced time points, which were 
used for subsequent analysis. This sampling frequency was chosen to be representative of reasonable 
sampling in biological time-series, and provides enough points to detect a signal while not too many 
that errors can be avoided entirely. For the convenience of the inquisitive reader, we have also provided 
a simple function in the associated R package where the user can vary the sampling scheme and 
parameter values and rerun this analysis. This time series is shown in the top panel of Figure [2] 

The first empirical data set comes from the population dynamics of Daphnia living in the chemostat 



"H6" in the experiments of Drake & Griffen (Drake and Griffen, 2010). This individual replicate was 
chosen as an example that showed a pattern of increasing variance over the 16 data points where the 
system was being manipulated towards a crash. This time series is shown in the top panel of Figure |3j 
Our second empirical data set comes from the glaciation record seen in deuterium levels in Antarctic 



ice cores (Petit et al. , 1999), as analyzed by Dakos et al. (2008). The data are preprocessed by 
linear interpolation and de-trending by Gaussian kernel smoothing to be as consistent as possible with 
the original analysis. We focus on the third glaciation event, consisting of 121 sample points. The 



match is not exact since Dakos et al. (2008) estimates the de-trending window size manually, but the 
estimated correlations in the first-order auto-regression coefficients are in close agreement with that 
analysis. De-trending is intended to make the data consistent with the assumptions of the warning 
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Simulation timeseries data 
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Figure 2: A model-based calculation of warning signals for the simulated data example. Top panel: The original time series 
data on which model parameters for Equations Q and Q are estimated. Middle panel: replicate simulations under the 
maximum likelihood estimated (MLE) parameters of the null model, Equation Q and test model, Equation ([2|. Bottom 



panel: The distribution of deviances (differences in log likelihood. Equation (12|), when both null and test models are 
fit to each of the replicates from the null model, "null," in red, and these differences when estimating for each of the 
replicates from the test model, in blue. The overlap of distributions indicate replicates that will be difficult to tell apart. 
The observed differences in the original data are indicated by the vertical line. 
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Chemostat timeseries data 
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Figure 3: A model-based calculation of warning signals for the Daphnia data analyzed in Drake and Griffen (20101 
(Chemostat H6). Panels as in Figure [5] 
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signal detection (Dakos et al. , 2008), which did not apply to the other data sets (Drake and Griffen 



2010). This time series is shown in the top panel of Figure |4j 



Analysis 

The deviances 6 observed are 5.1, 6.0, 83.9 for the simulation, the chemostat data and the glaciation 
data, respectively. Based on AIC score each is large enough to reject the null hypothesis of a stable 
model with its one extra parameter, but this does not give the full picture of the anticipated error rates. 
The size of these differences reflects not only the magnitude of the difference in fit between the models 
but also the arbitrary units of the raw likelihoods, which are smaller for larger data-sets. Consequently 
the glaciation score refiects as much the greater length of its time series as it does anything else. 

Our simulation approach can provide a better sense of the relative trade-off in error rates associated 



with these estimates. As described above (Section 3.1), we simulate 500 replicates under each model 



shown in the middle panels of Figures H [I and m and det ermine the distributions in likelihood ratio 
under each, shown in the lower panels. The observed deviance from the original data is also indicated 
(vertical line). 

The ROC curves for each of these data sets are plotted in Figure [5j While differences in the rate at 
which the system approaches a transition will also improve the ratio of true positives to false positives, 
here we see the best-sampled data set, Glaciation, with 121 points, also has the clearest signal with 
no observed errors in the 500 replicates of each type. Comparing the chemostat and simulation curves 
illustrate how the trade-off between false positives and true positives can vary between data. The 
chemostat signal, which estimates a relatively rapid rate of change but has less data, captures a higher 
rate of true positives for a given rate of false positives than the simulation data set with a weaker rate 
of change but more data, for false positive rates above 20%. However, the simulated set with more 
data performs better if lower false-positive rates are desired. 

5. Comparing the performance of summary statistics and model-based approaches 

Due to the variety of ways in which early warning signals based on summary statistics are im- 
plemented and evaluated it is difficult to give a straight-forward comparison between them and the 
performance of this model-based approach. However, by adopting one of one of the quantitative mea- 



sures of a warning signal pattern, such as Kendall's r (Dakos et al. 2008 2011a, 2009), we are able 
to make a side-by-side comparison of the different summary statistics and the model based approach 
in the context of false alarms and failed detections shown by the ROC curve. Values of r near unity 
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Figure 4: A model-based calculation of warning signals for the Glaciation data analyzed in Dakos et al. ( 2008 1 (Glaciation 
III). Panels as in Figure [5] 
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Figure 5: ROC curves for the Simulation, Chemostat, and Glaciation data, computed from the distributions shown in 
Figures [2] |3] and |4] bottom panel. 
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Figure 6: Early warning signals in simulated and empirical data sets. The first two columns are simulated data from 
(a) a stable system (Stable), and (b) the same system approaching a saddle-node bifurcation (Deteriorating). Empirical 
examples are from (c) Daphnia magna concentrations manipulated towards a critical transition (Daphnia), and (d) 
deuterium concentrations previously cited as an early warning signal of a glaciation period (Glaciation). Increases in 
summary statistics, computed over a moving window, have often been used to indicate if a system is moving towards 
a critical transition. The increase is measured by the correlation coefficient r. Note that positive correlation does not 
guarantee the system is moving towards a transition, as seen in the stable system, first column. 



indicate a strongly increasing trend in the warning indicator, which is supposed to be indicative of an 
approaching transition. Values near zero suggest a lack of a trend, as expected in stable systems. 

Figure [6] shows the time series for each data set in columns and the early warning indicators of 
variance and autocorrelation computed over a sliding window for each. Kendall's correlation coefficient 
r is calculated for each warning indicator and displayed on the graphs, inset. For comparison, the 
left-most column includes data simulated under a stable system, which nevertheless shows a chance 
increasing autocorrelation with a r = 0.7 We can adapt the approach we have described above to 
determine how often such a strong increase would appear by chance in a stable system as follows. 

By estimating the stable and critical transition models from the data, and simulating 500 replicate 
data sets under each as in the analysis above, we can then calculate the warning signals statistic over 
a sliding window of size equal to one-half the length of the time series, and compute the correlation 
coefficient r measuring the degree to which the statistic shows an increasing trend. This results in 
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Figure 7: Box-plots of the distributions of Kendall's r observed for the summary statistic methods variance and autocor- 
relation, applied to three different data sets (from Figures [5] [s] [4| . The distributions show extensive overlap, suggesting 
that it will be difficult to distinguish early warning signals by the correlation coefficient in these summary statistics. 



a distribution of r values coming from a model of a stable system, and a corresponding distribution 
of T values coming from the model with an impending transition. These distributions are shown in 
Figure [Tj Contrary to the expectation that replicates of the null model (stable system, Equation Q) 
would cluster around zero, while the test model. Equation ([2]), would cluster around larger positive 
r values, the observed r values on the replicates extend evenly across the range. This results in 
dramatic overlap and offers little ability to distinguish between the stable replicates and the replicates 
approaching a transition. 

The use of box plots in Figure [7] provide a convenient and familiar way to visualize the overlap 
between more than two distributions, though they lack the resolution of the overlapping density dis- 
tributions in Figures [2| [3| |4j The overlapping distributions are the natural representation from which 
to introduce the ROC curve, as in Figure [TJ 

The ROC curves for these data (Fig. [s]) show that the summary-statistic based indicators frequently 
lack the sensitivity to distinguish reliably between observed patterns from a stable or unstable system. 
The large correlations observed in the empirical examples (Fig. |6]) are not uncommon in stable systems. 
It is notable that in both empirical examples the summary statistics approach does little better than 
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Figure 8: ROC curves compare the performance of the summary statistics variance and autocorrelation against the 
likehhood-based approach from Figure [s] on each of three example data sets (Figures [2||3|p|. 





Variance 


Likelihood 


Simulation 


25 % 


61% 


Chemostat 


5.0% 


34% 


Glaciation 


5.4% 


100% 



Table 1: Fraction of crashes detected when the desired false alarm rate is fixed to 5% 



chance in distinguishing replicates that have been simulated from models ([2]) and Q, despite the 
fact that these models correspond to the assumptions of the summary statistics approaches. On the 
simulated data, the variance based method approaches the true-positive rate of our likelihood method 
at higher levels of false positives, but performs worse when the desired level of false positives is low. 
The ROC curve helps us compare the performance of the different approaches at different tolerances. 
For instance, Table [l] shows the fraction of true crashes caught at a 5% false positive rate. We can 
instead set a desired True positive rate and read off the resulting number of false alarms. Table [2j 





Variance 


Likelihood 


Simulation 


49 % 


55% 


Chemostat 


81 % 


35% 


Glaciation 


93 % 


0% 



Table 2: Fraction of false alarms when the desired detection rate is fixed to 90% 
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6. Discussion 

The challenge of determining early warning signs for impending possible regime shifts requires real 
attention to the underlying statistical issues and other assumptions. Doing this, does, however, open up 
new possibilities for asking what the goal of detection should be, and for clearly identifying underlying 
assumptions. We consider alternative approaches based either on summary statistics or a likelihood 
based model choice. By assuming the underlying model corresponds to a saddle-node bifurcation, our 
analysis presents a "best-case scenario" for both summary statistic and likelihood-based approaches. 
Other literature has already begun to address the additional challenges posed when the underlying 



dynamics do not correspond to these models (Hastings and Wysham, 2010). Our results illustrate that 



even in this best-case scenario, reliable identification of warning signals from summary statistics can 
be difficult. 

We have used three examples to illustrate the performance of this approach in data from simulation, 
a chemostat experiment, and paleo-atmospheric record; examples differing in sampling intensity and 
strength of signal of an approaching collapse. While the well-sampled geological data shows an unmis- 
takable signal in this model-based approach, the uncertainty in the smaller simulated and experimental 
data forces a trade-off between errors. 

As a way to clearly illustrate the choices involved in looking for warning signals while avoiding 
false alarms, we introduce an approach based on receiver operator curves. These curves illustrate the 
extent to which an potential warning signal mitigates the trade-off between missed events and false 
alarms. The extent of the difficulty in finding reliable indicators of impending regime shifts based on 
summary statistics becomes clear from the ROC curves of these statistics, where a 5% false positive 
rate often corresponds to only a 5% true positive rate, performing no better than the flip of a coin. By 
estimating the ROC curve for a given set of data, we can better avoid applying warning signals in cases 
of inadequate power. By taking advantage of the assumptions being made to write down a specific 
likelihood function, we can develop approaches that get the most information from the data available. 

In any application of early warning signals, it is essential to address the question of model adequacy. 
Our approach formalizes the assumptions about the underlying process to match the assumptions of 
the other warning signals. As the bifurcation results from the principle eigenvalue passing through 
zero, the warning signal is expected in linear-order dynamics; estimation of the nonlinear model is 
less powerful and less accurate. The performance of this approach in the simulated data - which is 
nonlinear in its dynamics and driven with non-Gaussian noise introduced by the Poisson demographic 
events - demonstrates the accuracy under violation of these assumptions. 
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The conclusion is not simply that likelihood approaches are more reliable, but rather more broadly 
that warning signals should consider the inherent trade-off between sensitivity and accuracy, and must 
quantify how this trade-off depends on both the indicators used and the data available. The approach 
developed here estimates the risk of both failed detection and false alarms; concepts which are critical 
to prediction-based management. Using the methods we have outlined when designing early warning 
strategies for natural systems can ensure that data collection has adequate power to offer a reasonable 
chance of detection. 
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